{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Using Python for Information Retrieval\n",
    "\n",
    "## Solutions"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "import csv"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# PART A: Start with one document"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1. Read, Clean, Assign\n",
    "\n",
    "**task**:\n",
    "\n",
    "1. Read one document\n",
    "2. Collect information on the country and year\n",
    "3. Keep the section we're interested in\n",
    "4. Turn each line into an item in a list.\n",
    "\n",
    "**skills**:\n",
    "- file reading\n",
    "- [string](https://github.com/dlab-berkeley/python-intensive/blob/master/Glossary.md#string) splicing\n",
    "- string methods\n",
    "- indexing"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 1.1 Read in \"cotedivoire2014.txt\"\n",
    "\n",
    "Fill in the blanks to read in the file. We'll need to include the `encoding='utf8'` optional parameter to the `open()` function to ensure that the text file is read correctly on all operating systems."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {},
   "outputs": [],
   "source": [
    "# SOLUTION\n",
    "directory = './data/txts'\n",
    "file_name = \"cotedivoire2014.txt\"\n",
    "with open(directory + '/'+ file_name,'r', encoding='utf8') as f:\n",
    "    text = f.read()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 1.2 Assign country and year variables \n",
    "\n",
    "You'll notice that the file name consists of the name of the country and the year. We can use this to get that information. Slice the file name to create 2 new variables, `country`, and `year`.\n",
    "\n",
    "Be careful! Remember that we are going to apply this to the other file names later. Make sure that however you slice \"cotedivoire2014.txt\" would work for the other files in the `data/txts` directory."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {},
   "outputs": [],
   "source": [
    "# SOLUTION\n",
    "country = file_name[:-8]\n",
    "year = file_name[-8:-4]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 1.3 Get the Recommendations Section\n",
    "\n",
    "Note that the section we want starts with `\"II. Conclusions and/or recommendations\\n\"`. What [method](https://github.com/dlab-berkeley/python-intensive/blob/master/Glossary.md#method) would you use to get everything after this substring? Fill in the blank below and [assign](https://github.com/dlab-berkeley/python-intensive/blob/master/Glossary.md#assign) the value to a new variable called `rec_text`.\n",
    "\n",
    "Note: there is certainly more than one way to do this, but the code below suggests one string method in particular. If you have time, think about what other methods or libraries you could use to get certain substrings."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {},
   "outputs": [],
   "source": [
    "# SOLUTION\n",
    "sections = text.split(\"II. Conclusions and/or recommendations\\n\")\n",
    "rec_text = sections[1]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 1.4 Turn it into a list\n",
    "\n",
    "Using a string method, turn the string above into a list of lines, and store it in a variable called `recs`. Remember that a new line is represented by `\\n`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['127. The recommendations listed below enjoy the support of C™te dÕIvoire: ',\n",
       " '127.1 Consider the accession to core human rights instruments (Lesotho); and to other main international human rights treaties that it is not yet a party to (Philippines); ',\n",
       " '127.2 Make efforts towards the ratification of the OP-CAT (Chile); ',\n",
       " '127.3 Ratify the OP-CAT (Ghana, Tunisia), as recommended previously in 2009 (Czech Republic) and take policy measures to prevent torture and ill-treatment (Estonia); ',\n",
       " '127.4 Accede to the OP-CAT as soon as possible (Uruguay); ',\n",
       " '127.5 Consider ratifying OP-CAT (Burkina Faso); ',\n",
       " '127.6 Ratify the International Convention on the Protection of the Rights of All Migrant Workers and Members of Their Families (ICRMW) (Ghana); ',\n",
       " '127.7 Consider acceding to the ICRMW (Chad); ',\n",
       " '127.8 Make efforts towards the ratification of ICCPR-OP 2 (Chile); ',\n",
       " '127.9 Ratify ICCPR-OP 2 (Rwanda) to abolish death penalty (France, Montenegro); ',\n",
       " '127.10 Accede to the Agreement on the Privileges and Immunities of the International Criminal Court (Slovakia); ',\n",
       " '127.11 Sign and ratify the Optional Protocol to the International Covenant on Economic, Social and Cultural Rights (Portugal); ',\n",
       " '127.12 Fully implement CEDAW (Israel); ',\n",
       " '127.13 Ratify the third Optional Protocol to CRC (Portugal); ',\n",
       " '127.14 Sign (Portugal) ratify (France, Portugal, Tunisia) and accede to the International Convention for the Protection of All Persons from Enforced Disappearance as soon as possible (Uruguay); ']"
      ]
     },
     "execution_count": 34,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# SOLUTION\n",
    "recs = rec_text.split(\"\\n\")\n",
    "recs[:15]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 1.5 Make a function\n",
    "\n",
    "Let's put all of that code into a function that will read in a file and return a list of recommendations."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "metadata": {},
   "outputs": [],
   "source": [
    "def read_recommendations(filename):\n",
    "    # read document\n",
    "    with open(directory + '/'+ filename,'r', encoding='utf8') as f:\n",
    "        text = f.read()\n",
    "    \n",
    "    # collect info on country and year\n",
    "    country = filename[:-8]\n",
    "    year = filename[-8:-4]\n",
    "    \n",
    "    # get rec section\n",
    "    sections = text.split(\"II. Conclusions and/or recommendations\\n\")\n",
    "    rec_text = sections[1]\n",
    "    \n",
    "    # turn recs into a list\n",
    "    recs = rec_text.split(\"\\n\")\n",
    "    \n",
    "    return recs"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2. Chunk Recomendations\n",
    "\n",
    "**task**:\n",
    "\n",
    "These texts have 3 sections each. \n",
    "1. The first section contains those recommendations the country supports. \n",
    "2. The second section contains recs the country will examine. \n",
    "3. The third contains recommendations the country explicitely rejects. \n",
    "\n",
    "We want to chunk the the text into three lists, `accept`, `examine`, `reject` -- each containing their respective recommendations.\n",
    "\n",
    "**skills**:\n",
    "- string methods\n",
    "- lists\n",
    "- loops\n",
    "- conditionals\n",
    "- indexing"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2.1: Find the paragraph numbers\n",
    "\n",
    "Each section starts with a main paragraph number (e.g. **123**). The individual recommendations are then noted as subparagraphs (e.g. **123.1, 123.2** etc.).\n",
    "\n",
    "All the accepted recommendations have the same main paragraph number (**123**). Next come the recommendations which will be examined, whose main paragraph number is just the next integer (**124**). After that are the rejected recommendations, with the next integer as their main paragraph number (**125**).\n",
    "\n",
    "We can't know the paragraph numbers beforehand. But we *can* leverage our knowledge of the structure of the documents to get them.\n",
    "\n",
    "Fill in the blanks below to create 3 variables containing the 3 paragraph numbers."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "metadata": {},
   "outputs": [],
   "source": [
    "# SOLUTION\n",
    "para1 = recs[0].split(\".\")[0]\n",
    "para1 = int(para1)\n",
    "para2 = para1 + 1\n",
    "para3 = para2 + 1"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2.2 Parse the text\n",
    "\n",
    "Now create 3 new lists: `accept`, `examine`, `reject.` Complete the for loop code to filter through `recs` and assign each recommendation to its corresponding section.\n",
    "\n",
    "**hint**: How do you know if a line belongs to a section? It starts with the main paragraph number for that section. So use the **.startswith()** method."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "metadata": {},
   "outputs": [],
   "source": [
    "# allocate lists for the 3 types of recommendations\n",
    "accept_recs = []\n",
    "examine_recs = []\n",
    "reject_recs = []\n",
    "\n",
    "# iterate through all the recommendations and add each one to the appropriate list\n",
    "for line in recs:\n",
    "    if line.startswith(str(para1)):\n",
    "        accept_recs.append(line)\n",
    "    elif line.startswith(str(para2)):\n",
    "        examine_recs.append(line)\n",
    "    elif line.startswith(str(para3)):\n",
    "        reject_recs.append(line)\n",
    "\n",
    "# remove the first item from each list, which just demarcates the sections\n",
    "accept_recs = accept_recs[1:]\n",
    "examine_recs = examine_recs[1:]\n",
    "reject_recs = reject_recs[1:]    "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2.3 Make a function\n",
    "\n",
    "Let's again put the code we just created to parse the text into 3 separate lists into a function."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "metadata": {},
   "outputs": [],
   "source": [
    "def parse_recommendations(recs):\n",
    "    # SOLUTION\n",
    "    para1 = recs[0].split(\".\")[0]\n",
    "    para1 = int(para1)\n",
    "    para2 = para1 + 1\n",
    "    para3 = para2 + 1    \n",
    "\n",
    "    # allocate lists for the 3 types of recommendations\n",
    "    accept_recs = []\n",
    "    examine_recs = []\n",
    "    reject_recs = []\n",
    "\n",
    "    # iterate through all the recommendations and add each one to the appropriate list\n",
    "    for line in recs:\n",
    "        if line.startswith(str(para1)):\n",
    "            accept_recs.append(line)\n",
    "        elif line.startswith(str(para2)):\n",
    "            examine_recs.append(line)\n",
    "        elif line.startswith(str(para3)):\n",
    "            reject_recs.append(line)\n",
    "\n",
    "    # remove the first item from each list, which just demarcates the sections\n",
    "    accept_recs = accept_recs[1:]\n",
    "    examine_recs = examine_recs[1:]\n",
    "    reject_recs = reject_recs[1:]        \n",
    "    \n",
    "    return (accept_recs, examine_recs, reject_recs)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3. Get Recommending Country\n",
    "\n",
    "**skills**\n",
    "\n",
    "- string methods\n",
    "- indexing\n",
    "- functions\n",
    "\n",
    "**task**\n",
    "- extract the substring representing the recommending country."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 3.1 Extracting the Country\n",
    "\n",
    "Take a look at several recommendations to get an idea of their format. I've given you several samples below."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 39,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "127.1 Consider the accession to core human rights instruments (Lesotho); and to other main international human rights treaties that it is not yet a party to (Philippines); \n",
      "127.2 Make efforts towards the ratification of the OP-CAT (Chile); \n",
      "127.3 Ratify the OP-CAT (Ghana, Tunisia), as recommended previously in 2009 (Czech Republic) and take policy measures to prevent torture and ill-treatment (Estonia); \n",
      "127.4 Accede to the OP-CAT as soon as possible (Uruguay); \n",
      "127.5 Consider ratifying OP-CAT (Burkina Faso); \n"
     ]
    }
   ],
   "source": [
    "for cur_rec in accept_recs[:5]: \n",
    "    print(cur_rec)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Notice that they're all formatted the same way, with the recommending country in parenthesis at the end, in between parentheses.\n",
    "\n",
    "Using your string skills, find a way to pull out the recommending country from the first recommendation (stored in `first_rec` below)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 40,
   "metadata": {},
   "outputs": [],
   "source": [
    "first_rec = accept_recs[0]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 41,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Philippines\n"
     ]
    }
   ],
   "source": [
    "# SOLUTION\n",
    "rec_after_paran = first_rec.split('(')[-1]\n",
    "first_rec_country = rec_after_paran.split(')')[0]\n",
    "print(first_rec_country)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 3.2 Create a Function\n",
    "\n",
    "Create a function called `get_country` that passes an individual recommendation and returns the recommending country"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 42,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'Philippines'"
      ]
     },
     "execution_count": 42,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# SOLUTION\n",
    "def get_country(rec):\n",
    "    rec_after_paran = rec.split('(')[-1]\n",
    "    rec_country = rec_after_paran.split(')')[0]\n",
    "    return(rec_country)\n",
    "\n",
    "# test you code\n",
    "get_country(first_rec)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 4. Processing all Recommendations\n",
    "\n",
    "**task**:\n",
    "\n",
    "We now want to create a new list for each variable we eventually want in our output csv file. Each list will contain a single value per individual recommendation. The five variables we want a list for are: \n",
    "\n",
    "1. `to`: the country under review\n",
    "2. `from`: the country (or countries) giving the recommendation\n",
    "3. `year`: the year of the review (all 2014 here)\n",
    "4. `decision`: whether the recommendation was supported, rejected, etc.\n",
    "5. `text`: the text of the recommendation\n",
    "\n",
    "To make it easier to store these data (and later to write it out to a csv file), we'll create a dictionary with an empty list for each of these variable names.\n",
    "\n",
    "**skills**:\n",
    "- loops\n",
    "- dictionaries"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 43,
   "metadata": {},
   "outputs": [],
   "source": [
    "rec_output = {'to':[],\n",
    "              'from':[],\n",
    "              'year':[],\n",
    "              'decision':[],\n",
    "              'text':[]}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 4.1 Process the `accept` Recommendations\n",
    "\n",
    "The code below loops through all the recommentations in the `accept` list and appends an item to each of the 5 lists within the dictionary defined above. Fill in the blanks to complete the code.\n",
    "\n",
    "(Remember we've already created the `country` and `year` variables above!)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 44,
   "metadata": {},
   "outputs": [],
   "source": [
    "# SOLUTION\n",
    "for rec in accept_recs:\n",
    "    rec_output['to'].append(country)\n",
    "    rec_output['from'].append(get_country(rec))\n",
    "    rec_output['year'].append(year)\n",
    "    rec_output['decision'].append('accept')\n",
    "    rec_output['text'].append(rec)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 4.2 Make a function \n",
    "\n",
    "Now write a function that does the same for any list of recommendations. It should first create an output dictionary and then populate that dictionary. Think about all the parameters that the function should take in order to fill in all 5 fields of the `rec_output` dictionary. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 45,
   "metadata": {},
   "outputs": [],
   "source": [
    "def process_recs(recs, to_country, year, decision_type):\n",
    "    # Create the output dictionary\n",
    "    output = {'to':[],\n",
    "              'from':[],\n",
    "              'year':[],\n",
    "              'decision':[],\n",
    "              'text':[]}\n",
    "    \n",
    "    # loop over the recommendations and fill the output dictionary's lists\n",
    "    for rec in recs:\n",
    "        output['to'].append(to_country)\n",
    "        output['from'].append(get_country(rec))\n",
    "        output['year'].append(year)\n",
    "        output['decision'].append(decision_type)\n",
    "        output['text'].append(rec)\n",
    "        \n",
    "    return output"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "### 4.3 Process all the Recommendations\n",
    "\n",
    "Now use the function that you just wrote to process the recommendations from the `accept` the `examine` and `reject` recommendation lists."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 46,
   "metadata": {},
   "outputs": [],
   "source": [
    "# FILL ME OUT\n",
    "output_accept_recs = process_recs(accept_recs, country, year, 'accept')\n",
    "output_examine_recs = process_recs(examine_recs, country, year, 'examine')\n",
    "output_reject_recs = process_recs(reject_recs, country, year, 'reject')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    " ### 4.4 Combine output dictionaries\n",
    " \n",
    "Now let's write a function that takes a list of output recommendation dictionaries and creates a new one that is the combination of all of them. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 47,
   "metadata": {},
   "outputs": [],
   "source": [
    "def combine_outputs(dicts):\n",
    "    # create a new dictionary to contain the combined values of all the dictionaries\n",
    "    output = {'to':[],\n",
    "              'from':[],\n",
    "              'year':[],\n",
    "              'decision':[],\n",
    "              'text':[]}\n",
    "    \n",
    "    # Loop over all the input dictionaries\n",
    "    for cur_dict in dicts:        \n",
    "        # loop over all the keys in the output dictionary\n",
    "        for cur_key in output.keys():\n",
    "            # extend the list which is the value of the current key using the list from the current dictionary\n",
    "            cur_keys_list = cur_dict[cur_key]\n",
    "            output[cur_key].extend(cur_keys_list)\n",
    "\n",
    "    return output"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now combine the output dictionaries for the accept, examine, and reject recommendations into a single output dictionary"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 48,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "186\n",
      "186\n"
     ]
    }
   ],
   "source": [
    "output_recs = combine_outputs([output_accept_recs, output_examine_recs, output_reject_recs])\n",
    "\n",
    "# uncomment to test your code\n",
    "print(len(accept_recs) + len(examine_recs) + len(reject_recs))\n",
    "print(len(output_recs['to']))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# PART B: Repeat for all documents\n",
    "\n",
    "We just wrote code that takes one document and turns it into a dataset!\n",
    "\n",
    "The problem is we have 11 documents!\n",
    "\n",
    "We'll now combine the code we've written so far to create a function that can read one document at a time, and then read all 11 documents into a single dataset."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 5. Make a function\n",
    "\n",
    "**task**\n",
    "\n",
    "Combine the functions that you wrote above to create a single function that takes a filename as a parameter and returns a dictionary of lists representing all of the recommendations in that document.\n",
    "\n",
    "**skills**\n",
    "- Functions\n",
    "- Copying and pasting :)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 49,
   "metadata": {},
   "outputs": [],
   "source": [
    "# SOLUTION\n",
    "\n",
    "def process_document(filename):\n",
    "\n",
    "    # Use the function we wrote to read in the recommendations\n",
    "    recs = read_recommendations(filename)\n",
    "    \n",
    "    # Use the function we wrote to parse the recommendations we read in\n",
    "    accept_recs, examine_recs, reject_recs = parse_recommendations(recs)\n",
    "    \n",
    "    # Get the \"to\" country\n",
    "    \n",
    "    country = filename[:-8]\n",
    "\n",
    "    # Use the function to process the three recommendation types\n",
    "    output_accept_recs = process_recs(accept_recs, country, year, 'accept')\n",
    "    output_examine_recs = process_recs(examine_recs, country, year, 'examine')\n",
    "    output_reject_recs = process_recs(reject_recs, country, year, 'reject')\n",
    "\n",
    "    # combine the processed recommendations for the accept, examine and reject types\n",
    "    output_recs = combine_outputs([output_accept_recs, output_examine_recs, output_reject_recs])\n",
    "\n",
    "    return(output_recs)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 50,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "97\n"
     ]
    }
   ],
   "source": [
    "# test your code!\n",
    "print(len(process_document(\"tuvalu2013.txt\")['to']))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 6. Process all of the files\n",
    "\n",
    "**task**\n",
    "\n",
    "1. Find the file_names in our directory.\n",
    "2. Apply the function above to all the filenames\n",
    "3. Create a master dataset\n",
    "\n",
    "**skills**\n",
    "- I/O\n",
    "- Loops\n",
    "- Functions"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 6.1 Make a list of file_names\n",
    "\n",
    "The code below reads all the file_names in the directory `data/txts`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 51,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "sanmarino2014.txt\n",
      "tuvalu2013.txt\n",
      "kazakhstan2014.txt\n",
      "cotedivoire2014.txt\n",
      "fiji2014.txt\n",
      "bangladesh2013.txt\n",
      "turkmenistan2013.txt\n",
      "jordan2013.txt\n",
      "monaco2013.txt\n",
      "afghanistan2014.txt\n",
      "djibouti2013.txt\n"
     ]
    }
   ],
   "source": [
    "# SOLUTION\n",
    "directory = 'data/txts'\n",
    "for file_name in os.listdir(directory):\n",
    "    print(file_name)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Modify the code to include only the file_names that end in `.txt`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 52,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "sanmarino2014.txt\n",
      "tuvalu2013.txt\n",
      "kazakhstan2014.txt\n",
      "cotedivoire2014.txt\n",
      "fiji2014.txt\n",
      "bangladesh2013.txt\n",
      "turkmenistan2013.txt\n",
      "jordan2013.txt\n",
      "monaco2013.txt\n",
      "afghanistan2014.txt\n",
      "djibouti2013.txt\n"
     ]
    }
   ],
   "source": [
    "# SOLUTION\n",
    "for file_name in os.listdir(directory):\n",
    "    if file_name.endswith(\".txt\"):\n",
    "        print(file_name)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 6.2 Process all the documents\n",
    "\n",
    "Fill in the blanks below to process all the documents."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 53,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "processing:  sanmarino2014.txt\n",
      "processing:  tuvalu2013.txt\n",
      "processing:  kazakhstan2014.txt\n",
      "processing:  cotedivoire2014.txt\n",
      "processing:  fiji2014.txt\n",
      "processing:  bangladesh2013.txt\n",
      "processing:  turkmenistan2013.txt\n",
      "processing:  jordan2013.txt\n",
      "processing:  monaco2013.txt\n",
      "processing:  afghanistan2014.txt\n",
      "processing:  djibouti2013.txt\n"
     ]
    }
   ],
   "source": [
    "# SOLUTION\n",
    "output_recs = []\n",
    "for filename in os.listdir(directory):\n",
    "    # Assume all txt files contain meaningful data\n",
    "    if filename.endswith(\".txt\"):\n",
    "        print(\"processing: \", filename)\n",
    "        \n",
    "        # Process the current file using the function\n",
    "        cur_output_recs = process_document(filename)\n",
    "        output_recs.append(cur_output_recs)\n",
    "\n",
    "# Combine the output dictionaries from all of the files we've read in\n",
    "output_recs_final = combine_outputs(output_recs)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 54,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "1709"
      ]
     },
     "execution_count": 54,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Should be 1709\n",
    "len(output_recs_final['to'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 6.3 Save to file\n",
    "\n",
    "Now we'll create a `pandas` `DataFrame` around our dataset and write it to a CSV file, and we're done!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 55,
   "metadata": {},
   "outputs": [],
   "source": [
    "#writing column headings\n",
    "import pandas as pd\n",
    "\n",
    "# create a dataframe using the dictionary we've created\n",
    "output_recs_df = pd.DataFrame(output_recs_final)\n",
    "\n",
    "# write the DataFrame\n",
    "output_recs_df.to_csv('upr-recs.csv')"
   ]
  }
 ],
 "metadata": {
  "anaconda-cloud": {},
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.3"
  },
  "toc": {
   "base_numbering": 1,
   "nav_menu": {},
   "number_sections": false,
   "sideBar": true,
   "skip_h1_title": false,
   "title_cell": "Table of Contents",
   "title_sidebar": "Contents",
   "toc_cell": false,
   "toc_position": {},
   "toc_section_display": true,
   "toc_window_display": true
  },
  "varInspector": {
   "cols": {
    "lenName": 16,
    "lenType": 16,
    "lenVar": 40
   },
   "kernels_config": {
    "python": {
     "delete_cmd_postfix": "",
     "delete_cmd_prefix": "del ",
     "library": "var_list.py",
     "varRefreshCmd": "print(var_dic_list())"
    },
    "r": {
     "delete_cmd_postfix": ") ",
     "delete_cmd_prefix": "rm(",
     "library": "var_list.r",
     "varRefreshCmd": "cat(var_dic_list()) "
    }
   },
   "types_to_exclude": [
    "module",
    "function",
    "builtin_function_or_method",
    "instance",
    "_Feature"
   ],
   "window_display": false
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}
